Data Description:

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain:

Object recognition

Context:

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:

● All the features are geometric features extracted from the silhouette. ● All are numeric in nature.

Learning Outcomes:

● Exploratory Data Analysis ● Reduce number dimensions in the dataset with minimal information loss ● Train a model using Principal Components

Objective:

Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using raw data.

Import libraries and Read the dataset using function .dropna() - to avoid dealing with NAs

In [1]:
# Numerical libraries
import numpy as np

# to handle data in form of rows and columns 
import pandas as pd    

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns


vehicle_df = pd.read_csv('vehicle.csv')

1. Data pre-processing - Understand the data and treat missing values (Use box plot), outliers

In [2]:
vehicle_df.shape
Out[2]:
(846, 19)
In [3]:
vehicle_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [4]:
vehicle_df.head(10)
Out[4]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 NaN 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 NaN 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car

Print/ Plot the dependent (categorical variable) and Check for any missing values in the data

In [5]:
#Since the variable is categorical, you can use value_counts function
pd.value_counts(vehicle_df['class'])
Out[5]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
pd.value_counts(vehicle_df["class"]).plot(kind="bar")
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x21f37094e10>
In [7]:
vehicle_df.isna().sum()
Out[7]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [8]:
vehicle_df.dtypes
Out[8]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [9]:
#several null values.. we will just look at the box plot now to check data shape and try and replace missing values

for column in vehicle_df.select_dtypes(include=[np.number]):
    plt.figure()
    box_plot = sns.boxplot(data=vehicle_df[column], orient="h")
    box_plot.set(xlabel=column)
In [10]:
vehicle_df_na_removed = vehicle_df.fillna(vehicle_df.median())

vehicle_df_na_removed.head(10)
Out[10]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
5 107 44.0 106.0 172.0 50.0 6 255.0 26.0 28.0 169 280.0 957.0 264.0 85.0 5.0 9.0 181.0 183 bus
6 97 43.0 73.0 173.0 65.0 6 153.0 42.0 19.0 143 176.0 361.0 172.0 66.0 13.0 1.0 200.0 204 bus
7 90 43.0 66.0 157.0 65.0 9 137.0 48.0 18.0 146 162.0 281.0 164.0 67.0 3.0 3.0 193.0 202 van
8 86 34.0 62.0 140.0 61.0 7 122.0 54.0 17.0 127 141.0 223.0 112.0 64.0 2.0 14.0 200.0 208 van
9 93 44.0 98.0 167.0 62.0 11 183.0 36.0 22.0 146 202.0 505.0 152.0 64.0 4.0 14.0 195.0 204 car
In [11]:
#Outlier removal -- we will make outliers collapse to 5th and 95th percentile
vehicle_df_na_removed.columns = ['compactness','circularity','distance_circularity','radius_ratio','pr_axis_aspect_ratio','max_length_aspect_ratio','scatter_ratio','elongatedness','pr_axis_rectangularity','max_length_rectangularity','scaled_variance','scaled_variance_1','scaled_radius_of_gyration','scaled_radius_of_gyration_1','skewness_about','skewness_about_1','skewness_about_2','hollows_ratio','class']
#found somthing at -- https://www.kaggle.com/general/24617

# a number "a" from the vector "x" is an outlier if 
# a > median(x)+1.5*iqr(x) or a < median-1.5*iqr(x)
# iqr: interquantile range = third interquantile - first interquantile
def outliers(x): 
       return np.abs(x- x.median()) > 1.5*(x.quantile(0.75)-x.quantile(0.25))

# Replace the upper outlier(s) with the 95th percentile and the lower one(s) with the 5th percentile 
def replace(x):   # x is a vector
    out = x[outliers(x)]
    return x.replace(to_replace = [out.min(),out.max()], value = [np.percentile(x,5),np.percentile(x,95)])

vehicle_df_out_normalized = vehicle_df_na_removed.select_dtypes(include=[np.number]).apply(replace,axis=0)
#vehicle_df_out_normalized.to_csv('vehicle_df_out_removed.csv')

vehicle_df_out_normalized.shape
Out[11]:
(846, 18)
In [12]:
#boxplot cleaned data
for column in vehicle_df_out_normalized.select_dtypes(include=[np.number]):
    plt.figure()
    box_plot = sns.boxplot(data=vehicle_df_out_normalized[column], orient="h")
    box_plot.set(xlabel=column)

I now have 2 data sets

a-vehicle_df_na_removed :: which is the NaN imputed (but without outliers normalized)

b-vehicle_df_out_normalized :: which is a copy of a and also with outliers normalized to 5th and 95th percentile

Standardize the data

In [13]:
# Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we 
#go for any modelling. You can use zscore function to do this
In [14]:
interest_df = vehicle_df_out_normalized.copy()
In [15]:
from scipy.stats import zscore
interest_df_z = interest_df.apply(zscore)
In [16]:
interest_df_z.head()
Out[16]:
compactness circularity distance_circularity radius_ratio pr_axis_aspect_ratio max_length_aspect_ratio scatter_ratio elongatedness pr_axis_rectangularity max_length_rectangularity scaled_variance scaled_variance_1 scaled_radius_of_gyration scaled_radius_of_gyration_1 skewness_about skewness_about_1 skewness_about_2 hollows_ratio
0 0.161898 0.545148 0.057177 0.286237 1.401479 0.343390 -0.203152 0.136262 -0.203971 0.772874 -0.398747 -0.338398 0.293092 -0.333970 -0.023785 0.409713 -0.279530 0.183957
1 -0.327266 -0.607428 0.120741 -0.839225 -0.622277 0.111355 -0.596837 0.520519 -0.597717 -0.336583 -0.592269 -0.617900 -0.510230 -0.054367 0.611659 0.182274 0.050062 0.452977
2 1.262519 0.874455 1.519141 1.229192 0.591977 0.343390 1.159604 -1.144597 0.977268 0.703533 1.117171 1.121860 1.405385 0.085434 1.670734 -0.386323 -0.114734 0.049447
3 -0.082684 -0.607428 -0.006386 -0.291703 0.187225 0.111355 -0.748254 0.648605 -0.597717 -0.336583 -0.914805 -0.737687 -1.468037 -1.312579 -0.023785 -0.272604 1.698020 1.529056
4 -1.061013 -0.113467 -0.769150 1.107521 5.583908 10.088858 -0.596837 0.520519 -0.597717 -0.267242 1.697736 -0.646421 0.416681 7.634704 0.611659 -0.158885 -1.433100 -1.699181

2. Understanding the attributes - Find relationships between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (15 points)

In [17]:
sns.pairplot(interest_df_z, diag_kind="kde")
Out[17]:
<seaborn.axisgrid.PairGrid at 0x21f36dfce10>

Observations

1 from the KDE on the diagonal we can see there are atleast 2 classes and possibly a 3rd

2 several of the independent fields have strong co-relation with each other as we see an elongated cloud in multple places Ex :: Scaled_Variance vs scatter_ratio Scaled_Variance_1 vs scatter_ratio

3. There are some corelation that is slightly non linear also . ex :: elongatedness vs scatter ratio

In [18]:
corr = interest_df_z.corr()
print (corr)
                             compactness  circularity  distance_circularity  \
compactness                     1.000000     0.662783              0.790168   
circularity                     0.662783     1.000000              0.767622   
distance_circularity            0.790168     0.767622              1.000000   
radius_ratio                    0.706605     0.610519              0.781705   
pr_axis_aspect_ratio            0.098308     0.164195              0.168678   
max_length_aspect_ratio         0.173223     0.269235              0.289502   
scatter_ratio                   0.808292     0.807296              0.902208   
elongatedness                  -0.789437    -0.793754             -0.911307   
pr_axis_rectangularity          0.788125     0.781698              0.875371   
max_length_rectangularity       0.673540     0.912141              0.765461   
scaled_variance                 0.767337     0.761048              0.863809   
scaled_variance_1               0.809510     0.796378              0.885608   
scaled_radius_of_gyration       0.579977     0.887136              0.700825   
scaled_radius_of_gyration_1    -0.258770     0.040250             -0.237052   
skewness_about                  0.220475     0.150698              0.118794   
skewness_about_1                0.165022     0.012648              0.255432   
skewness_about_2                0.295900    -0.069195              0.173579   
hollows_ratio                   0.364497     0.059880              0.332732   

                             radius_ratio  pr_axis_aspect_ratio  \
compactness                      0.706605              0.098308   
circularity                      0.610519              0.164195   
distance_circularity             0.781705              0.168678   
radius_ratio                     1.000000              0.629557   
pr_axis_aspect_ratio             0.629557              1.000000   
max_length_aspect_ratio          0.415768              0.563521   
scatter_ratio                    0.752688              0.125179   
elongatedness                   -0.806922             -0.200650   
pr_axis_rectangularity           0.721023              0.111625   
max_length_rectangularity        0.569699              0.132958   
scaled_variance                  0.785977              0.251115   
scaled_variance_1                0.736798              0.109819   
scaled_radius_of_gyration        0.539055              0.128430   
scaled_radius_of_gyration_1     -0.256397              0.068536   
skewness_about                   0.046052             -0.085772   
skewness_about_1                 0.174165             -0.024569   
skewness_about_2                 0.392725              0.254693   
hollows_ratio                    0.476252              0.280890   

                             max_length_aspect_ratio  scatter_ratio  \
compactness                                 0.173223       0.808292   
circularity                                 0.269235       0.807296   
distance_circularity                        0.289502       0.902208   
radius_ratio                                0.415768       0.752688   
pr_axis_aspect_ratio                        0.563521       0.125179   
max_length_aspect_ratio                     1.000000       0.189874   
scatter_ratio                               0.189874       1.000000   
elongatedness                              -0.201189      -0.966818   
pr_axis_rectangularity                      0.208085       0.941395   
max_length_rectangularity                   0.320902       0.791820   
scaled_variance                             0.297300       0.944587   
scaled_variance_1                           0.167833       0.980630   
scaled_radius_of_gyration                   0.202060       0.787822   
scaled_radius_of_gyration_1                 0.210138      -0.030583   
skewness_about                              0.021471       0.086557   
skewness_about_1                            0.048197       0.214387   
skewness_about_2                           -0.017835       0.032238   
hollows_ratio                               0.146823       0.126517   

                             elongatedness  pr_axis_rectangularity  \
compactness                      -0.789437                0.788125   
circularity                      -0.793754                0.781698   
distance_circularity             -0.911307                0.875371   
radius_ratio                     -0.806922                0.721023   
pr_axis_aspect_ratio             -0.200650                0.111625   
max_length_aspect_ratio          -0.201189                0.208085   
scatter_ratio                    -0.966818                0.941395   
elongatedness                     1.000000               -0.922035   
pr_axis_rectangularity           -0.922035                1.000000   
max_length_rectangularity        -0.766394                0.781235   
scaled_variance                  -0.940369                0.886891   
scaled_variance_1                -0.950315                0.935498   
scaled_radius_of_gyration        -0.761814                0.753256   
scaled_radius_of_gyration_1       0.105365               -0.053871   
skewness_about                   -0.065772                0.090528   
skewness_about_1                 -0.182031                0.232018   
skewness_about_2                 -0.134580                0.033681   
hollows_ratio                    -0.216905                0.144074   

                             max_length_rectangularity  scaled_variance  \
compactness                                   0.673540         0.767337   
circularity                                   0.912141         0.761048   
distance_circularity                          0.765461         0.863809   
radius_ratio                                  0.569699         0.785977   
pr_axis_aspect_ratio                          0.132958         0.251115   
max_length_aspect_ratio                       0.320902         0.297300   
scatter_ratio                                 0.791820         0.944587   
elongatedness                                -0.766394        -0.940369   
pr_axis_rectangularity                        0.781235         0.886891   
max_length_rectangularity                     1.000000         0.738173   
scaled_variance                               0.738173         1.000000   
scaled_variance_1                             0.781045         0.939019   
scaled_radius_of_gyration                     0.857664         0.769948   
scaled_radius_of_gyration_1                   0.041162         0.073289   
skewness_about                                0.141473         0.051598   
skewness_about_1                              0.016894         0.197115   
skewness_about_2                             -0.081138         0.039643   
hollows_ratio                                 0.075213         0.092573   

                             scaled_variance_1  scaled_radius_of_gyration  \
compactness                           0.809510                   0.579977   
circularity                           0.796378                   0.887136   
distance_circularity                  0.885608                   0.700825   
radius_ratio                          0.736798                   0.539055   
pr_axis_aspect_ratio                  0.109819                   0.128430   
max_length_aspect_ratio               0.167833                   0.202060   
scatter_ratio                         0.980630                   0.787822   
elongatedness                        -0.950315                  -0.761814   
pr_axis_rectangularity                0.935498                   0.753256   
max_length_rectangularity             0.781045                   0.857664   
scaled_variance                       0.939019                   0.769948   
scaled_variance_1                     1.000000                   0.784171   
scaled_radius_of_gyration             0.784171                   1.000000   
scaled_radius_of_gyration_1          -0.020549                   0.195868   
skewness_about                        0.094904                   0.173358   
skewness_about_1                      0.201150                  -0.041861   
skewness_about_2                      0.031501                  -0.202662   
hollows_ratio                         0.111671                  -0.116379   

                             scaled_radius_of_gyration_1  skewness_about  \
compactness                                    -0.258770        0.220475   
circularity                                     0.040250        0.150698   
distance_circularity                           -0.237052        0.118794   
radius_ratio                                   -0.256397        0.046052   
pr_axis_aspect_ratio                            0.068536       -0.085772   
max_length_aspect_ratio                         0.210138        0.021471   
scatter_ratio                                  -0.030583        0.086557   
elongatedness                                   0.105365       -0.065772   
pr_axis_rectangularity                         -0.053871        0.090528   
max_length_rectangularity                       0.041162        0.141473   
scaled_variance                                 0.073289        0.051598   
scaled_variance_1                              -0.020549        0.094904   
scaled_radius_of_gyration                       0.195868        0.173358   
scaled_radius_of_gyration_1                     1.000000       -0.076426   
skewness_about                                 -0.076426        1.000000   
skewness_about_1                               -0.109439       -0.014513   
skewness_about_2                               -0.747839        0.077130   
hollows_ratio                                  -0.836164        0.073381   

                             skewness_about_1  skewness_about_2  hollows_ratio  
compactness                          0.165022          0.295900       0.364497  
circularity                          0.012648         -0.069195       0.059880  
distance_circularity                 0.255432          0.173579       0.332732  
radius_ratio                         0.174165          0.392725       0.476252  
pr_axis_aspect_ratio                -0.024569          0.254693       0.280890  
max_length_aspect_ratio              0.048197         -0.017835       0.146823  
scatter_ratio                        0.214387          0.032238       0.126517  
elongatedness                       -0.182031         -0.134580      -0.216905  
pr_axis_rectangularity               0.232018          0.033681       0.144074  
max_length_rectangularity            0.016894         -0.081138       0.075213  
scaled_variance                      0.197115          0.039643       0.092573  
scaled_variance_1                    0.201150          0.031501       0.111671  
scaled_radius_of_gyration           -0.041861         -0.202662      -0.116379  
scaled_radius_of_gyration_1         -0.109439         -0.747839      -0.836164  
skewness_about                      -0.014513          0.077130       0.073381  
skewness_about_1                     1.000000          0.071884       0.187978  
skewness_about_2                     0.071884          1.000000       0.843624  
hollows_ratio                        0.187978          0.843624       1.000000  

Observation, high co-variance found, i many places more than 0.5 covariance

In [19]:
# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x21f44526048>
In [20]:
# ref ::https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.api as sm

def variance_inflation_factors(exog_df):
    '''
    Parameters
    ----------
    exog_df : dataframe, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression.

    Returns
    -------
    vif : Series
        variance inflation factors
    '''
    exog_df = add_constant(exog_df)
    vifs = pd.Series(
        [1 / (1. - sm.OLS(exog_df[col].values, 
                       exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) 
         for col in exog_df],
        index=exog_df.columns,
        name='VIF'
    )
    return vifs
In [21]:
df_tmp = interest_df_z.copy()
variance_inflation_factors(df_tmp)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
Out[21]:
const                           1.000000
compactness                     5.050814
circularity                     8.723132
distance_circularity           11.042702
radius_ratio                   20.636286
pr_axis_aspect_ratio            7.710081
max_length_aspect_ratio         2.758036
scatter_ratio                  47.617249
elongatedness                  27.723714
pr_axis_rectangularity         10.342068
max_length_rectangularity       8.477741
scaled_variance                19.833070
scaled_variance_1              30.805092
scaled_radius_of_gyration       7.797358
scaled_radius_of_gyration_1    10.025446
skewness_about                  1.203265
skewness_about_1                1.420836
skewness_about_2                5.034288
hollows_ratio                   9.929013
Name: VIF, dtype: float64

VIF Guide :: A value of 1 means that the predictor is not correlated with other variables. The higher the value, the greater the correlation of the variable with other variables. Values of more than 4 or 5 are sometimes regarded as being moderate to high, with values of 10 or more being regarded as very high.

Based on this we can see that distance_circularity, radius_ratio, scatter_ratio, elongatedness, pr_axis_rectangularity, scaled_variance, scaled_variance_1, scaled_radius_of_gyration_1, hollows_ratio all these will cause problems.

3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance) - 20 points

In [22]:
covMatrix = np.cov(interest_df_z,rowvar=False)
print(covMatrix)
[[ 1.00118343  0.66356774  0.79110298  0.70744134  0.0984242   0.17342849
   0.80924881 -0.79037098  0.78905754  0.67433692  0.76824504  0.81046833
   0.58066335 -0.25907595  0.22073624  0.16521738  0.29625041  0.36492809]
 [ 0.66356774  1.00118343  0.7685308   0.61124128  0.16438885  0.26955317
   0.80825164 -0.79469345  0.78262269  0.91322001  0.76194906  0.79732
   0.88818587  0.04029727  0.15087664  0.01266283 -0.06927723  0.05995085]
 [ 0.79110298  0.7685308   1.00118343  0.78263059  0.16887726  0.28984455
   0.90327561 -0.9123854   0.87640715  0.76636644  0.86483131  0.88665588
   0.70165421 -0.23733261  0.11893421  0.25573424  0.17378401  0.33312625]
 [ 0.70744134  0.61124128  0.78263059  1.00118343  0.63030217  0.4162599
   0.75357871 -0.80787717  0.72187654  0.57037288  0.78690679  0.73766994
   0.539693   -0.25670052  0.04610698  0.17437098  0.39318942  0.47681528]
 [ 0.0984242   0.16438885  0.16887726  0.63030217  1.00118343  0.56418739
   0.12532674 -0.20088738  0.11175737  0.13311494  0.25141216  0.10994874
   0.12858229  0.06861722 -0.08587331 -0.02459789  0.25499397  0.2812229 ]
 [ 0.17342849  0.26955317  0.28984455  0.4162599   0.56418739  1.00118343
   0.19009909 -0.20142742  0.20833115  0.32128128  0.297652    0.16803154
   0.20229951  0.21038718  0.02149684  0.0482536  -0.017856    0.14699635]
 [ 0.80924881  0.80825164  0.90327561  0.75357871  0.12532674  0.19009909
   1.00118343 -0.96796252  0.94250898  0.792757    0.94570503  0.98179072
   0.78875467 -0.03061914  0.08665981  0.21464055  0.03227665  0.12666681]
 [-0.79037098 -0.79469345 -0.9123854  -0.80787717 -0.20088738 -0.20142742
  -0.96796252  1.00118343 -0.9231258  -0.7673013  -0.941482   -0.95143999
  -0.76271562  0.1054898  -0.0658496  -0.18224642 -0.13473917 -0.2171615 ]
 [ 0.78905754  0.78262269  0.87640715  0.72187654  0.11175737  0.20833115
   0.94250898 -0.9231258   1.00118343  0.78215964  0.88794028  0.9366051
   0.75414734 -0.0539343   0.09063485  0.2322923   0.03372111  0.14424404]
 [ 0.67433692  0.91322001  0.76636644  0.57037288  0.13311494  0.32128128
   0.792757   -0.7673013   0.78215964  1.00118343  0.73904626  0.78196953
   0.85867936  0.04121116  0.14164054  0.01691447 -0.08123418  0.07530189]
 [ 0.76824504  0.76194906  0.86483131  0.78690679  0.25141216  0.297652
   0.94570503 -0.941482    0.88794028  0.73904626  1.00118343  0.94013031
   0.77085955  0.07337608  0.05165907  0.19734855  0.03968994  0.09268241]
 [ 0.81046833  0.79732     0.88665588  0.73766994  0.10994874  0.16803154
   0.98179072 -0.95143999  0.9366051   0.78196953  0.94013031  1.00118343
   0.78509875 -0.02057356  0.09501634  0.20138791  0.03153818  0.11180287]
 [ 0.58066335  0.88818587  0.70165421  0.539693    0.12858229  0.20229951
   0.78875467 -0.76271562  0.75414734  0.85867936  0.77085955  0.78509875
   1.00118343  0.19609975  0.17356287 -0.04191014 -0.20290225 -0.11651633]
 [-0.25907595  0.04029727 -0.23733261 -0.25670052  0.06861722  0.21038718
  -0.03061914  0.1054898  -0.0539343   0.04121116  0.07337608 -0.02057356
   0.19609975  1.00118343 -0.07651616 -0.10956849 -0.74872416 -0.83715313]
 [ 0.22073624  0.15087664  0.11893421  0.04610698 -0.08587331  0.02149684
   0.08665981 -0.0658496   0.09063485  0.14164054  0.05165907  0.09501634
   0.17356287 -0.07651616  1.00118343 -0.01453032  0.07722161  0.0734676 ]
 [ 0.16521738  0.01266283  0.25573424  0.17437098 -0.02459789  0.0482536
   0.21464055 -0.18224642  0.2322923   0.01691447  0.19734855  0.20138791
  -0.04191014 -0.10956849 -0.01453032  1.00118343  0.07196949  0.18820016]
 [ 0.29625041 -0.06927723  0.17378401  0.39318942  0.25499397 -0.017856
   0.03227665 -0.13473917  0.03372111 -0.08123418  0.03968994  0.03153818
  -0.20290225 -0.74872416  0.07722161  0.07196949  1.00118343  0.84462258]
 [ 0.36492809  0.05995085  0.33312625  0.47681528  0.2812229   0.14699635
   0.12666681 -0.2171615   0.14424404  0.07530189  0.09268241  0.11180287
  -0.11651633 -0.83715313  0.0734676   0.18820016  0.84462258  1.00118343]]
In [23]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
pca.fit(interest_df_z)
Out[23]:
PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [24]:
print(pca.explained_variance_)
[9.30942553 2.98695576 1.70805308 1.14250309 0.93839322 0.5693209
 0.40500873 0.24311424 0.18360987 0.11124655]
In [25]:
print(pca.components_)
[[ 2.77750911e-01  2.84746580e-01  3.07250669e-01  2.74290294e-01
   8.48577180e-02  1.03676054e-01  3.16426226e-01 -3.16038605e-01
   3.07339452e-01  2.80502824e-01  3.08662504e-01  3.13130918e-01
   2.69312212e-01 -3.18997616e-02  4.42254916e-02  6.19060456e-02
   4.23389505e-02  8.11774418e-02]
 [-1.10939869e-01  1.27915077e-01 -6.27660376e-02 -1.92949928e-01
  -1.43927013e-01  1.48610878e-03  5.70726926e-02  4.15607066e-04
   4.65316862e-02  1.27841282e-01  6.72366887e-02  6.26419515e-02
   2.22292879e-01  5.01101667e-01 -2.56114248e-02 -1.07997018e-01
  -5.28757130e-01 -5.35548676e-01]
 [-1.36438976e-01 -8.02258341e-03 -5.90378304e-02  2.47247955e-01
   6.53658778e-01  5.89563930e-01 -9.27878164e-02  5.39622829e-02
  -9.50066687e-02 -5.05216571e-03  3.31698726e-02 -1.06079352e-01
  -1.86149009e-02  2.61837799e-01 -1.71065605e-01 -1.01182187e-01
  -1.13322682e-02  1.96466331e-02]
 [ 6.43626738e-02  1.90498499e-01 -6.16965263e-02 -4.02074563e-02
   2.75026849e-02  5.10947609e-02 -9.86379455e-02  7.82082772e-02
  -1.01840041e-01  1.88125181e-01 -1.19618710e-01 -8.97071865e-02
   1.98050540e-01 -8.28584694e-02  5.95293623e-01 -6.72794964e-01
   1.01305748e-01  5.85932155e-02]
 [ 4.95569174e-02 -6.75927417e-02  3.32795133e-02 -3.93872407e-02
  -2.19304843e-02  2.87520586e-01 -2.95676426e-02  8.44731812e-02
   5.19107246e-05 -4.35008880e-02 -1.73481856e-02 -3.34137760e-02
  -6.78608589e-02  1.24277920e-01  7.14033974e-01  5.97402988e-01
  -9.51865966e-02 -1.50083520e-03]
 [-7.49431389e-02  2.85575713e-01  1.23300406e-01 -2.60572458e-01
  -3.27573137e-01  4.65760108e-01 -1.13496362e-01  1.43338266e-01
  -3.34658549e-02  4.15292810e-01 -2.17739038e-01 -1.44627685e-01
   8.71408914e-02 -2.37203285e-01 -2.69771554e-01  1.27401256e-01
  -1.45070286e-01  2.37144453e-01]
 [ 3.73519524e-01 -3.45356238e-01  8.99190919e-02 -8.28336270e-02
  -3.80671316e-01  4.74010042e-01  9.22094985e-02 -4.14295048e-02
   1.08639218e-01 -2.03587616e-01  1.53666924e-01  1.08927291e-01
  -3.68129842e-01  9.20247435e-02 -3.63879637e-02 -3.32539930e-01
   9.59672630e-03 -4.16415552e-02]
 [-6.07857729e-01 -1.54006350e-01  3.61853234e-01  1.48400140e-01
  -2.70639626e-02  8.39997127e-02  9.08196906e-02 -1.81050372e-01
   1.06652911e-01 -2.29762941e-01  1.00091521e-02  4.43924408e-02
  -3.74734767e-02 -3.72190530e-01  1.40190351e-01 -1.59412346e-01
  -3.91665441e-01  1.95082862e-02]
 [ 5.19269142e-01 -4.32250980e-03  1.90324457e-02  2.09878013e-01
   2.50198335e-01 -1.63517085e-01 -4.85873445e-02  1.66693247e-01
   5.21318718e-02  6.06207062e-02 -2.59144071e-01 -5.57723243e-02
  -2.15899629e-01 -2.62493662e-01 -1.90752454e-02 -1.96234913e-02
  -6.11132925e-01 -6.21876213e-03]
 [ 2.37894241e-01  3.99178258e-02  2.50959073e-01  1.25880211e-01
  -1.22422001e-01  6.51023311e-02 -7.79254505e-02  5.52091957e-02
  -6.48361267e-01 -3.55307193e-01  2.38698852e-01 -9.46306764e-02
   4.34118528e-01 -1.19878867e-01 -6.99381012e-02  5.27975329e-02
  -7.71422722e-02 -7.50559493e-02]]
In [26]:
print(pca.explained_variance_ratio_)
[0.51657897 0.16574584 0.09477967 0.06339737 0.05207133 0.03159155
 0.02247389 0.01349038 0.01018849 0.00617306]
In [27]:
plt.bar(list(range(1,11)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [28]:
plt.step(list(range(1,11)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cumulative of variation explained')
plt.xlabel('eigen Value')
plt.show()
In [29]:
# 95 Per cent qantification
print(pca.explained_variance_ratio_.cumsum())
[0.51657897 0.68232481 0.77710448 0.84050185 0.89257318 0.92416474
 0.94663862 0.96012901 0.9703175  0.97649055]

Looks like to capture 95% of the of variance we need 8 components

In [30]:
# Lets get optimal PCA
pca_final = PCA(n_components=8)
pca_final.fit(interest_df_z)
print(pca_final.components_)
print(pca_final.explained_variance_ratio_)
interest_df_z_pca = pca_final.transform(interest_df_z)
interest_df_z_pca.shape
[[ 2.77750911e-01  2.84746580e-01  3.07250669e-01  2.74290294e-01
   8.48577180e-02  1.03676054e-01  3.16426226e-01 -3.16038605e-01
   3.07339452e-01  2.80502824e-01  3.08662504e-01  3.13130918e-01
   2.69312212e-01 -3.18997616e-02  4.42254916e-02  6.19060456e-02
   4.23389505e-02  8.11774418e-02]
 [-1.10939869e-01  1.27915077e-01 -6.27660376e-02 -1.92949928e-01
  -1.43927013e-01  1.48610878e-03  5.70726926e-02  4.15607066e-04
   4.65316862e-02  1.27841282e-01  6.72366887e-02  6.26419515e-02
   2.22292879e-01  5.01101667e-01 -2.56114248e-02 -1.07997018e-01
  -5.28757130e-01 -5.35548676e-01]
 [-1.36438976e-01 -8.02258341e-03 -5.90378304e-02  2.47247955e-01
   6.53658778e-01  5.89563930e-01 -9.27878164e-02  5.39622829e-02
  -9.50066687e-02 -5.05216571e-03  3.31698726e-02 -1.06079352e-01
  -1.86149009e-02  2.61837799e-01 -1.71065605e-01 -1.01182187e-01
  -1.13322682e-02  1.96466331e-02]
 [ 6.43626738e-02  1.90498499e-01 -6.16965263e-02 -4.02074563e-02
   2.75026849e-02  5.10947609e-02 -9.86379455e-02  7.82082772e-02
  -1.01840041e-01  1.88125181e-01 -1.19618710e-01 -8.97071865e-02
   1.98050540e-01 -8.28584694e-02  5.95293623e-01 -6.72794964e-01
   1.01305748e-01  5.85932155e-02]
 [ 4.95569174e-02 -6.75927417e-02  3.32795133e-02 -3.93872407e-02
  -2.19304843e-02  2.87520586e-01 -2.95676426e-02  8.44731812e-02
   5.19107246e-05 -4.35008880e-02 -1.73481856e-02 -3.34137760e-02
  -6.78608589e-02  1.24277920e-01  7.14033974e-01  5.97402988e-01
  -9.51865966e-02 -1.50083520e-03]
 [-7.49431389e-02  2.85575713e-01  1.23300406e-01 -2.60572458e-01
  -3.27573137e-01  4.65760108e-01 -1.13496362e-01  1.43338266e-01
  -3.34658549e-02  4.15292810e-01 -2.17739038e-01 -1.44627685e-01
   8.71408914e-02 -2.37203285e-01 -2.69771554e-01  1.27401256e-01
  -1.45070286e-01  2.37144453e-01]
 [ 3.73519524e-01 -3.45356238e-01  8.99190919e-02 -8.28336270e-02
  -3.80671316e-01  4.74010042e-01  9.22094985e-02 -4.14295048e-02
   1.08639218e-01 -2.03587616e-01  1.53666924e-01  1.08927291e-01
  -3.68129842e-01  9.20247435e-02 -3.63879637e-02 -3.32539930e-01
   9.59672630e-03 -4.16415552e-02]
 [-6.07857729e-01 -1.54006350e-01  3.61853234e-01  1.48400140e-01
  -2.70639626e-02  8.39997127e-02  9.08196906e-02 -1.81050372e-01
   1.06652911e-01 -2.29762941e-01  1.00091521e-02  4.43924408e-02
  -3.74734767e-02 -3.72190530e-01  1.40190351e-01 -1.59412346e-01
  -3.91665441e-01  1.95082862e-02]]
[0.51657897 0.16574584 0.09477967 0.06339737 0.05207133 0.03159155
 0.02247389 0.01349038]
Out[30]:
(846, 8)
In [31]:
sns.pairplot(pd.DataFrame(interest_df_z_pca), diag_kind='kde')
Out[31]:
<seaborn.axisgrid.PairGrid at 0x21f4a872860>

2 frames of interest now ::

  1. interest_df_z_pca -- with PCA
  2. interest_df_z -- without PCA

4. Use Support vector machines to classify the class(y) of vehicles and find the difference of accuracy with and without PCA on predictors(X). 20 points

In [32]:
y = vehicle_df['class'].values.ravel()
In [33]:
##Split into training and test set
from sklearn.model_selection import train_test_split

X_PCA_train, X_PCA_test, y_train, y_test = train_test_split(interest_df_z_pca, y, test_size=0.30, random_state=1)

With PCA

In [34]:
from sklearn import svm
clf_PCA = svm.SVC(gamma=0.025, C=3)
clf_PCA.fit(X_PCA_train,y_train)
pred_train = clf_PCA.predict(X_PCA_train)
In [35]:
clf_PCA.score(X_PCA_train, y_train)
Out[35]:
0.9577702702702703
In [36]:
pred_test = clf_PCA.predict(X_PCA_test)
clf_PCA.score(X_PCA_test, y_test)
Out[36]:
0.9409448818897638
In [37]:
from sklearn.metrics import classification_report,confusion_matrix
mat_train = confusion_matrix(y_train,pred_train)
print("Train set confusion matrix = \n",mat_train)
Train set confusion matrix = 
 [[148   6   5]
 [  3 288   5]
 [  0   6 131]]
In [38]:
mat_test = confusion_matrix(y_test,pred_test)
print("Test set confusion matrix = \n",mat_test)
Test set confusion matrix = 
 [[ 57   1   1]
 [  2 127   4]
 [  4   3  55]]
In [39]:
from sklearn import metrics
print("SVM Metrics = \n", metrics.classification_report(y_test, pred_test))
SVM Metrics = 
               precision    recall  f1-score   support

         bus       0.90      0.97      0.93        59
         car       0.97      0.95      0.96       133
         van       0.92      0.89      0.90        62

   micro avg       0.94      0.94      0.94       254
   macro avg       0.93      0.94      0.93       254
weighted avg       0.94      0.94      0.94       254

In [40]:
##Split into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(interest_df_z, y, test_size=0.30, random_state=1)

Without PCA

In [41]:
from sklearn import svm
clf = svm.SVC(gamma=0.025, C=3)
clf.fit(X_train,y_train)
pred_train = clf.predict(X_train)
In [42]:
clf.score(X_train, y_train)
Out[42]:
0.9797297297297297
In [43]:
pred_test = clf.predict(X_test)
clf.score(X_test, y_test)
Out[43]:
0.9488188976377953
In [44]:
mat_train = confusion_matrix(y_train,pred_train)
print("Train set confusion matrix = \n",mat_train)
Train set confusion matrix = 
 [[156   0   3]
 [  1 292   3]
 [  0   5 132]]
In [45]:
mat_test = confusion_matrix(y_test,pred_test)
print("Test set confusion matrix = \n",mat_test)
Test set confusion matrix = 
 [[ 57   1   1]
 [  2 128   3]
 [  5   1  56]]
In [46]:
print("SVM Metrics = \n", metrics.classification_report(y_test, pred_test))
SVM Metrics = 
               precision    recall  f1-score   support

         bus       0.89      0.97      0.93        59
         car       0.98      0.96      0.97       133
         van       0.93      0.90      0.92        62

   micro avg       0.95      0.95      0.95       254
   macro avg       0.94      0.94      0.94       254
weighted avg       0.95      0.95      0.95       254

Comparison

Model Name Dimension Training Accuracy Testing Accuracy class precision recall f1 Score support
SVM - PCA 8 0.9577 0.9409 - - - - -
- - - - bus 0.90 0.97 0.93 59
- - - - car 0.97 0.95 0.96 133
- - - - van 0.92 0.89 0.90 62
SVM - Non PCA 18 0.9797 0.9488 - - - - -
- - - - bus 0.89 0.97 0.93 59
- - - - car 0.98 0.96 0.97 133
- - - - van 0.93 0.90 0.92 62

Summary :: Overall considering that we have reduced 10 dimensions the accuracy for PCA is remarkable. Also the difference in training and testing accuracy is close by for PCA .. Which probably indicates that we will get good results in production also

5. Optional - Use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy

We will take the PCA data set as our base for further analyis.

In [47]:
# Ref :: https://medium.com/@aneesha/svm-parameter-tuning-in-scikit-learn-using-gridsearchcv-2413c02125a0
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, make_scorer
#Grid Search
from sklearn.model_selection import GridSearchCV
def svc_param_selection(X, y, nfolds):
    Cs = [0.01, 0.05, 0.5, 1]
    gammas = [0.001, 0.01, 0.1, 1]
    kernels = ['linear', 'rbf']
    param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels}
    # https://stackoverflow.com/questions/50752553/gridsearchcv-for-the-multi-class-svm-in-python
    my_scorer = make_scorer(accuracy_score, greater_is_better=True)
    grid_search = GridSearchCV(svm.SVC(), param_grid, cv=nfolds, scoring = my_scorer)
    grid_search.fit(X, y)
 
    return grid_search
In [48]:
grid_clf_acc = svc_param_selection(X_PCA_train, y_train,8 )
print (grid_clf_acc.best_params_)
{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
In [49]:
grid_clf_acc.score(X_PCA_train, y_train)
Out[49]:
0.9712837837837838
In [50]:
#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_PCA_test)
grid_clf_acc.score(X_PCA_test, y_test)
Out[50]:
0.9330708661417323
In [51]:
print("SVM Metrics = \n", metrics.classification_report(y_test, y_pred_acc))
SVM Metrics = 
               precision    recall  f1-score   support

         bus       0.94      0.98      0.96        59
         car       0.95      0.94      0.94       133
         van       0.90      0.87      0.89        62

   micro avg       0.93      0.93      0.93       254
   macro avg       0.93      0.93      0.93       254
weighted avg       0.93      0.93      0.93       254

In [52]:
#SVM (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)
Out[52]:
array([[ 58,   0,   1],
       [  3, 125,   5],
       [  1,   7,  54]], dtype=int64)

Try with NON PCA also

In [53]:
grid_clf_acc_Non_PCA = svc_param_selection(X_train, y_train,8 )
print (grid_clf_acc_Non_PCA.best_params_)
{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
In [54]:
grid_clf_acc_Non_PCA.score(X_train, y_train)
Out[54]:
0.9831081081081081
In [55]:
#Predict values based on new parameters
y_pred_acc = grid_clf_acc_Non_PCA.predict(X_test)
grid_clf_acc_Non_PCA.score(X_test, y_test)
Out[55]:
0.937007874015748
In [56]:
print("SVM Metrics = \n", metrics.classification_report(y_test, y_pred_acc))
SVM Metrics = 
               precision    recall  f1-score   support

         bus       0.89      0.98      0.94        59
         car       0.98      0.95      0.96       133
         van       0.90      0.87      0.89        62

   micro avg       0.94      0.94      0.94       254
   macro avg       0.92      0.93      0.93       254
weighted avg       0.94      0.94      0.94       254

Summary

With SVM based grid search and both PCA & non PCA data, it is found that the ideal combination is :: {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'} Note that the scorer used is for higher accuracy

Model Training Accuracy Testing Accuracy class precision recall f1 Score support
Grid SVM PCA 0.97128 0.93307 - - - - -
- - - bus 0.94 0.98 0.96 59
- - - car 0.95 0.94 0.94 133
- - - van 0.90 0.87 0.89 62
Grid SVM Non PCA 0.98310 0.93700 - - - - -
- - - bus 0.89 0.98 0.94 59
- - - car 0.98 0.95 0.96 133
- - - van 0.90 0.87 0.89 62

Just trying clustering for unsupervised

In [57]:
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(interest_df_z, 'ward', metric='euclidean')
Z.shape
Out[57]:
(845, 4)
In [58]:
Z[:]
Out[58]:
array([[1.37000000e+02, 3.99000000e+02, 1.33314777e-01, 2.00000000e+00],
       [5.10000000e+02, 7.85000000e+02, 3.99162559e-01, 2.00000000e+00],
       [4.79000000e+02, 5.58000000e+02, 5.54358879e-01, 2.00000000e+00],
       ...,
       [1.67200000e+03, 1.68600000e+03, 4.72310224e+01, 2.85000000e+02],
       [1.68500000e+03, 1.68700000e+03, 5.48551354e+01, 5.61000000e+02],
       [1.68800000e+03, 1.68900000e+03, 1.10395477e+02, 8.46000000e+02]])
In [59]:
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
In [60]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram (since we know 3 type of vehicles)
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [61]:
max_d = 50
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[61]:
array([3, 3, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 1, 3, 3,
       2, 3, 1, 3, 3, 1, 1, 3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 3, 1, 3, 3, 2,
       1, 3, 3, 3, 3, 2, 3, 2, 1, 3, 1, 2, 2, 3, 1, 3, 1, 3, 3, 3, 1, 3,
       3, 1, 2, 1, 1, 1, 2, 3, 3, 1, 2, 3, 1, 3, 3, 1, 3, 3, 2, 1, 3, 3,
       2, 3, 1, 2, 1, 3, 3, 1, 2, 3, 1, 3, 1, 2, 2, 2, 1, 1, 1, 2, 3, 1,
       2, 3, 3, 3, 3, 3, 1, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 2, 1, 1, 2,
       1, 3, 1, 1, 3, 2, 3, 2, 2, 3, 1, 3, 2, 1, 3, 2, 2, 2, 1, 2, 2, 1,
       2, 1, 2, 3, 2, 2, 3, 1, 2, 3, 1, 1, 2, 1, 3, 3, 1, 1, 2, 1, 3, 2,
       2, 3, 2, 3, 1, 3, 2, 3, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 3, 1, 3,
       3, 3, 3, 2, 1, 1, 2, 3, 2, 3, 3, 1, 2, 3, 2, 1, 3, 2, 3, 1, 3, 2,
       1, 3, 1, 3, 3, 2, 1, 2, 1, 3, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 2, 2,
       3, 1, 3, 3, 2, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 2, 1, 1, 3, 2, 2, 2,
       1, 3, 3, 2, 2, 3, 3, 1, 2, 2, 1, 2, 3, 3, 1, 2, 2, 3, 3, 1, 3, 2,
       2, 3, 1, 3, 3, 1, 2, 2, 1, 2, 1, 3, 1, 2, 1, 2, 2, 2, 3, 3, 1, 1,
       1, 1, 1, 3, 2, 1, 3, 3, 3, 1, 3, 1, 1, 1, 3, 1, 2, 3, 1, 3, 2, 2,
       2, 1, 1, 3, 1, 1, 3, 1, 2, 2, 2, 3, 3, 1, 1, 1, 1, 3, 2, 2, 1, 3,
       3, 3, 2, 2, 2, 1, 3, 1, 1, 1, 2, 3, 3, 1, 3, 3, 3, 3, 2, 2, 2, 2,
       3, 1, 1, 3, 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 1, 3, 3, 2, 3, 1, 2,
       1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 2,
       3, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 1, 1, 3, 3, 1, 2, 3, 3, 1, 1,
       1, 3, 2, 1, 1, 3, 1, 1, 1, 2, 2, 2, 2, 2, 1, 3, 3, 3, 1, 2, 2, 1,
       2, 3, 2, 3, 3, 1, 1, 2, 3, 1, 1, 1, 3, 1, 1, 3, 2, 3, 1, 1, 2, 3,
       3, 3, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 1, 3, 3, 1, 1, 1, 3, 3, 1, 1,
       1, 2, 3, 1, 3, 2, 1, 2, 3, 3, 2, 1, 3, 2, 2, 3, 2, 1, 1, 3, 1, 1,
       2, 3, 1, 1, 1, 3, 3, 2, 1, 3, 1, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2, 1,
       3, 3, 2, 3, 1, 1, 1, 3, 3, 1, 1, 2, 1, 3, 2, 1, 1, 2, 3, 2, 1, 2,
       2, 3, 1, 1, 1, 1, 2, 3, 3, 3, 1, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 2,
       3, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 3, 1, 3, 3, 2, 3, 2,
       2, 3, 3, 1, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 3, 1, 3, 3, 3, 3,
       2, 1, 1, 3, 1, 2, 2, 3, 2, 3, 1, 2, 1, 3, 2, 2, 2, 3, 2, 3, 2, 1,
       3, 1, 3, 3, 2, 3, 2, 1, 2, 3, 1, 2, 1, 2, 2, 1, 3, 1, 3, 2, 3, 2,
       3, 1, 2, 3, 3, 1, 3, 1, 2, 3, 1, 3, 2, 3, 2, 2, 3, 3, 1, 1, 2, 2,
       1, 1, 3, 2, 3, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 3, 1, 2, 3,
       1, 3, 3, 1, 1, 1, 3, 1, 3, 3, 1, 1, 1, 2, 1, 2, 3, 1, 2, 3, 2, 3,
       2, 1, 2, 3, 2, 2, 2, 2, 1, 3, 3, 3, 1, 1, 3, 1, 1, 3, 2, 2, 1, 2,
       3, 1, 1, 3, 3, 2, 1, 1, 1, 3, 1, 2, 1, 1, 3, 3, 1, 3, 1, 2, 3, 2,
       1, 1, 2, 3, 2, 1, 1, 2, 3, 3, 2, 2, 1, 3, 3, 1, 3, 3, 1, 3, 2, 3,
       3, 3, 3, 1, 1, 1, 3, 1, 2, 1, 1, 2, 1, 1, 3, 3, 2, 2, 1, 3, 3, 1,
       3, 2, 3, 2, 2, 2, 3, 1, 2, 3], dtype=int32)
In [62]:
#### plt.figure(figsize=(10, 8))
plt.scatter(interest_df_z['compactness'], interest_df_z['circularity'], c=clusters)  # plot points with cluster dependent colors
plt.show()
In [63]:
# Just trying clustering for unsupervised with PCA
#### generate the linkage matrix
Z_PCA = linkage(interest_df_z_pca, 'ward', metric='euclidean')
Z_PCA.shape
Out[63]:
(845, 4)
In [64]:
Z_PCA[:]
Out[64]:
array([[1.37000000e+02, 3.99000000e+02, 1.21814285e-01, 2.00000000e+00],
       [5.10000000e+02, 7.85000000e+02, 2.69779023e-01, 2.00000000e+00],
       [2.66000000e+02, 5.80000000e+02, 3.65581273e-01, 2.00000000e+00],
       ...,
       [1.67000000e+03, 1.68500000e+03, 4.72315762e+01, 2.59000000e+02],
       [1.68600000e+03, 1.68700000e+03, 5.44984892e+01, 5.87000000e+02],
       [1.68800000e+03, 1.68900000e+03, 1.09112070e+02, 8.46000000e+02]])
In [65]:
plt.figure(figsize=(25, 10))
dendrogram(Z_PCA)
plt.show()
In [66]:
dendrogram(
    Z_PCA,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [67]:
max_d = 50
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_PCA, max_d, criterion='distance')
clusters
Out[67]:
array([3, 3, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 1, 3, 3,
       2, 2, 1, 3, 3, 1, 1, 3, 3, 2, 2, 1, 2, 2, 3, 1, 1, 3, 1, 3, 3, 3,
       1, 3, 3, 3, 3, 2, 3, 3, 1, 2, 1, 2, 2, 3, 1, 3, 1, 3, 3, 3, 2, 3,
       3, 1, 3, 1, 1, 1, 2, 3, 3, 1, 3, 3, 1, 3, 3, 1, 2, 3, 3, 1, 3, 3,
       2, 3, 1, 3, 1, 3, 3, 1, 3, 3, 1, 3, 1, 2, 2, 3, 2, 1, 1, 3, 3, 2,
       3, 3, 3, 3, 3, 2, 1, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 2, 1, 1, 2,
       2, 3, 1, 1, 3, 2, 3, 3, 2, 3, 1, 3, 2, 1, 2, 3, 2, 2, 1, 2, 2, 1,
       2, 1, 2, 3, 3, 2, 3, 1, 2, 3, 1, 1, 2, 1, 3, 3, 1, 1, 2, 1, 3, 3,
       2, 3, 2, 3, 1, 3, 2, 3, 1, 2, 2, 2, 1, 3, 1, 3, 2, 1, 3, 3, 1, 3,
       3, 3, 3, 2, 1, 1, 2, 3, 2, 3, 3, 1, 2, 3, 3, 1, 3, 3, 3, 1, 3, 2,
       1, 3, 1, 3, 3, 2, 1, 2, 1, 3, 3, 3, 3, 1, 2, 3, 2, 3, 1, 3, 2, 2,
       3, 1, 3, 3, 2, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 2, 1, 1, 3, 2, 3, 2,
       1, 3, 3, 2, 2, 3, 3, 2, 2, 2, 1, 3, 3, 3, 1, 2, 2, 3, 3, 1, 3, 2,
       3, 3, 1, 3, 3, 1, 3, 2, 1, 2, 1, 3, 2, 2, 1, 2, 2, 2, 3, 3, 1, 1,
       1, 1, 2, 3, 3, 1, 3, 3, 3, 2, 3, 1, 1, 1, 3, 1, 2, 3, 1, 3, 2, 3,
       2, 1, 1, 3, 1, 1, 3, 1, 2, 3, 3, 3, 3, 1, 1, 1, 1, 3, 2, 2, 2, 3,
       2, 3, 2, 2, 3, 1, 3, 1, 1, 1, 2, 2, 3, 1, 3, 3, 3, 2, 2, 2, 2, 2,
       3, 1, 1, 3, 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 1, 3, 3, 2, 3, 1, 3,
       1, 3, 2, 2, 1, 3, 1, 2, 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 2, 2,
       3, 2, 3, 1, 2, 3, 2, 2, 1, 3, 1, 2, 1, 1, 3, 3, 1, 2, 3, 3, 2, 1,
       1, 3, 2, 1, 1, 3, 1, 1, 1, 2, 2, 3, 2, 2, 1, 3, 3, 3, 1, 2, 2, 1,
       2, 3, 2, 3, 3, 1, 1, 2, 3, 1, 1, 1, 3, 1, 1, 3, 2, 3, 1, 1, 2, 3,
       3, 3, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 1, 3, 3, 1, 1, 1, 3, 3, 2, 1,
       1, 3, 3, 1, 3, 3, 1, 2, 3, 3, 2, 1, 3, 2, 3, 3, 2, 1, 1, 2, 1, 1,
       2, 3, 2, 1, 1, 3, 3, 2, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 3, 2, 3, 1,
       3, 3, 3, 3, 1, 2, 1, 3, 3, 1, 1, 3, 1, 3, 3, 2, 1, 3, 3, 2, 1, 2,
       2, 3, 1, 1, 1, 1, 2, 3, 3, 3, 1, 1, 1, 2, 1, 3, 3, 1, 3, 3, 3, 2,
       3, 1, 3, 3, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 3, 3, 1, 3, 3, 2, 3, 2,
       2, 3, 3, 1, 1, 3, 2, 1, 3, 1, 3, 3, 1, 2, 3, 1, 3, 1, 3, 3, 2, 3,
       2, 1, 1, 3, 2, 2, 2, 3, 2, 3, 1, 3, 1, 3, 2, 2, 2, 3, 3, 3, 2, 1,
       2, 2, 3, 2, 2, 3, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 3, 1, 3, 3, 3, 2,
       3, 1, 2, 3, 2, 1, 3, 1, 2, 3, 1, 3, 2, 3, 3, 2, 3, 2, 1, 1, 2, 2,
       1, 1, 3, 2, 3, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 3, 2, 2, 3,
       1, 3, 3, 1, 1, 1, 3, 1, 3, 3, 1, 1, 1, 2, 1, 3, 2, 1, 2, 3, 3, 3,
       2, 1, 2, 3, 2, 2, 2, 3, 1, 3, 3, 3, 2, 1, 3, 1, 1, 3, 2, 3, 1, 2,
       3, 1, 1, 3, 3, 2, 2, 1, 1, 3, 1, 2, 2, 1, 3, 3, 1, 3, 1, 3, 3, 2,
       1, 2, 3, 3, 3, 1, 1, 3, 3, 3, 2, 2, 1, 3, 3, 1, 3, 3, 1, 3, 2, 3,
       3, 3, 3, 1, 2, 2, 3, 1, 2, 1, 1, 3, 2, 1, 3, 3, 2, 2, 1, 2, 3, 1,
       3, 2, 3, 2, 2, 3, 3, 1, 3, 3], dtype=int32)
In [68]:
#### plt.figure(figsize=(10, 8))
plt.scatter(interest_df_z_pca[:,0], interest_df_z_pca[:,1], c=clusters)  # plot points with cluster dependent colors
plt.show()
In [ ]: